This course - Introduction to Open Data Science - makes use of the open research tools with open data and how to apply data-driven statistics.
Link to my project: https://github.com/SusannaSuntila/IODS-project
date()
## [1] "Mon Dec 5 16:36:15 2022"
I am excited and I only hope that I can keep up with the pace and content of this course, but it all sounds very interesting. I have taken some courses in R, but I have not used it professionally or for that long. I want to become even more familiar in using R and I want to deepen my skills in visualizing, sharing, wrangling, testing and analyzing data. I heard about this course in my previous course for research skills.
date()
## [1] "Mon Dec 5 16:36:15 2022"
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# the data-set that I created
learning2014 <- read.csv(file.path(".", "data", "learning2014.csv"))
# Look at the dimensions of the data
dim(learning2014)
## [1] 166 7
# Look at the structure of the data
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
There are 166 rows (observations) and 7 columns (variables) in this data. The gender variable is a character, the other variables are numbers or integers.
The original data is from a questionnaire that tracked different learning approaches and achievements on an introductory statistical course at university level. Gender and age are self-explanatory. The points variable tells how many exam points the person had. The attitude variable is a sum of 10 questions related to student’s attitude towards statistics, in a 1-5 scale. The remaining variables deep, stra and surf are combination variables that have been combined from multiple questions with the same dimension. The variables are averages of the selected questions concerning deep, surface and strategic learning.
library(ggplot2)
# set the theme white
theme_set(theme_bw())
# plot of student's attitude and points
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) +
geom_point() +
geom_smooth(method = "lm") +
scale_colour_manual(values = c("lightpink2", "maroon")) +
labs(title = "Student's attitude compared to exam points",
subtitle = "learning2014 data")
## `geom_smooth()` using formula = 'y ~ x'
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# add color by gender
gender_col <- c("lightpink2", "maroon")[unclass(factor(learning2014$gender))]
pairs(learning2014[-1], pch = 19, col = gender_col)
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# with color by gender
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20))) +
scale_colour_manual(values = c("lightpink2", "maroon")) +
scale_fill_manual(values = c("lightpink2", "maroon"))
First, there are clearly more women who have participated in this questionnaire than men. When looking at the distribution of the variables, it is clear that age is concentrated below thirty, with a long tail towards right. With the age variable, the skew is larger than with any other variable. There is also more variation between men compared to women. Interestingly with the attitude variable, there is more variation with women compared to men. Men also have slightly higher mean, and their distribution is more skewed. The deep variable has a longer tail on the left side, so the observations are more concentrated on the higher side. The difference between genders is very small in this case. The stra variable has a wide distribution. The surf variable is more concentrated among women compared to men. The point variable has a thick tale among lower scores and the distribution is more concentrated among higher scores.
The largest and most significant correlation can be found with attitude and points. It is positive, so better attitude is related to better points in the exam. The lowest correlation is between deep learning and points from the exam which is interesting as well. Surf variable is also significantly correlating with attitude and deep.
library(GGally)
library(ggplot2)
# draw a plot of the linear relation of exam points and attitude
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'
# create a regression model with multiple explanatory variables
model1 <- lm(points ~ attitude + gender + stra, data = learning2014)
# print out a summary of the model
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + gender + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.7179 -3.3285 0.5343 3.7412 10.9007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9798 2.4030 3.737 0.000258 ***
## attitude 3.5100 0.5956 5.893 2.13e-08 ***
## genderM -0.2236 0.9248 -0.242 0.809231
## stra 0.8911 0.5441 1.638 0.103419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared: 0.2051, Adjusted R-squared: 0.1904
## F-statistic: 13.93 on 3 and 162 DF, p-value: 3.982e-08
In this model the dependent variable is exam points as instructed. The three explanatory variables that I chose are attitude, gender and stra. In the summary of the model it can be seen that only the intercept and the variable attitude are statistically significant, so attitude is the only variable strongly associated with exam points in this particular model. The coefficient on attitude, 3.5, tells that one unit change in attitude is related to a 3.5 point increase in exam points, conditional on other variables of this model. The gender variable is a dummy variable that has the value 1 when the person’s gender is male, so the coefficient of the gender variable describes the difference between the two genders, and when the person is male, it lowers the exam points by -0.22, again conditional on the other variables of this model. The stra variable that is computed from strategic questions has a coefficient of 0.89, meaning that better strategic learning will increase the exam points. The F-test that all three coefficients would be zero has a very small p-value, so it can be rejected.
As the variables gender and stra were not statistically significant, I dropped the gender variable first in the next model.
# create a second regression model
model2 <- lm(points ~ attitude + stra, data = learning2014)
# print out a summary of the model
summary(model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
The attitude variable is still statistically highly significant. The attitude variable is positively related to the exam points, with 3.5 increase with one unit change in attitude for the better conditional on the stra variable. The intercept of this model means that if attitude would be 0, exam points would be 11.6 according to this model. What is interesting, is that when gender variable is removed from the model, the stra variable becomes significant at 10 percent level. It has a coefficient 0.9, meaning that one unit increase in strategic learning is related to a 0.9 increase in exam points, conditional on the attitude variable.
The multiple R-squared is 0.20, which means that the two variables attitude and stra together explain 20 percent of the variation in exam points.
Compared to the last model, the F-statistic has risen as well, and the p-value is smaller.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(model2,which = c(1,2,5))
The first plot that compares the residuals with the fitted values shows that the points are spread around quite randomly, except for a few outliers, so the model is pretty appropriate. The plot does not indicate non-linear trends or non-constant variance.
The Q-Q-plot shows that the residuals are somewhat normally distributed. However, the lower tail is more heavy, so the values are larger there than would be expected and the upper tail is lighter, so values are smaller there than expected.
The residuals vs leverage plot shows if there are any outliers that would be significant for the model. The Cook’s distance curved lines don’t show in this plot, so the outliers aren’t too disturbing and there aren’t any points that would have high residuals and too much leverage at the same time.
date()
## [1] "Mon Dec 5 16:37:21 2022"
library(tidyverse)
# read in the data that was created
alc <- read.csv(file.path(".", "data", "alc.csv"))
# print names of the variables
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data set is constructed from a secondary school student questionnaire of Portuguese schools. As the names of the variables indicate, the questions cover topics such as school and studies, but have also social and demographic aspects. The alc data set is combined from two data sets dealing with the students’ performance in math and Portuguese language.
The binary variable address is an interesting one, I would predict that when living in urban areas high alcohol use would be more probable, as there are more options to use alcohol than for students living in rural areas.
Another variable that I want to look at is the studytime one (weekly study time, divided into 4 groups, where 4 is the highest amount spent in studying), to see if spending more time studying is negatively related to high alcohol use, as the student is spending more time with studies and is possibly more committed to school.
I would hypothesize also that goout variable, telling how much the student goes out with friends (1-5 scale where 5 is very high) is positively linked to high alcohol use, as alcohol is usually a socially related issue.
Lastly I wanted to include the more continuous variable absences, which measures the number of school absences (0-93). I would hypothesize that more absences are positively linked to high alcohol use, as drinking a lot might affect the student’s attendance at school.
First the address variable
library(ggplot2)
# set the background white
theme_set(theme_bw())
# bar plots of address variable
g1 <- ggplot(data = alc, aes(x = address, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))
g2 <- ggplot(data = alc, aes(x = address, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))
# combine the two plots
library(patchwork)
g1 | g2
As can be expected, most of the students live in urban areas, and so more people have high use of alcohol levels in urban areas, but contrary to what I thought relatively more people have high use of alcohol in rural areas.
Next the studytime variable:
# studytime variable
g3 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("plum3", "plum4"))
g4 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("plum3", "plum4"))
# combine the two plots
library(patchwork)
g3 | g4
As I thought, less studytime has higher proportion of students who have high use of alcohol. This is clear for both absolutely and proportionally.
Next the goout variable:
# the going out variable
g5 <- ggplot(data = alc, aes(x = goout, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("palevioletred3", "palevioletred4"))
g6 <- ggplot(data = alc, aes(x = goout, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion") +
scale_fill_manual(values = c("palevioletred3", "palevioletred4"))
# combine the two plots
library(patchwork)
g5 | g6
Here the high use of alcohol increases with the amount of going out with friends, and the pattern is very clear.
And lastly the absences variable:
# absences
g7 <- ggplot(data = alc, aes(x = absences, fill = high_use)) +
geom_bar() +
theme(legend.position = "none") +
scale_fill_manual(values = c("pink3", "pink4"))
g8 <- ggplot(data = alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("pink3", "pink4")) +
ylab("proportion")
# combine the two plots
library(patchwork)
g7 | g8
Here it is also clear that increase in absences is linked to an increase in high use of alcohol as well, but there are also some outliers with few students who have a very high amount of absences.
Looking at the absences variable closer with a box plot:
# box plot of absences with color by address
ggplot(data = alc, aes(x = high_use, y = absences, col = address)) +
geom_boxplot() +
scale_color_manual(values = c("violetred", "violetred4"))
The median of absences is the same whether student is form rural or urban area when there is no high use of alcohol, but the median for absences is slightly higher for students from rural areas if they belong to the high use of alcohol group.
Next few cross-tabulations. First I wanted to add a table that includes address, the mean of going out and of course high use of alcohol.
# table of address, high use of alcohol and the mean of going out
alc %>% group_by(address, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'address'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups: address [2]
## address high_use count mean_goout
## <chr> <lgl> <int> <dbl>
## 1 R FALSE 50 2.6
## 2 R TRUE 31 3.52
## 3 U FALSE 209 2.91
## 4 U TRUE 80 3.81
The mean of going out increases quite a lot when the alcohol use is high for both rural and urban area students. The mean of going out in both cases of alcohol use is higher for students who live in urban areas compared to rural area students, as would be expected.
Another cross-tabulation of the mean of studytime and mean of absences with regards to alcohol use:
# table of address, high use of alcohol and the mean of going out
alc %>% group_by(high_use) %>% summarise(count = n(), mean_studytime = mean(studytime), mean_absences = mean(absences))
## # A tibble: 2 × 4
## high_use count mean_studytime mean_absences
## <lgl> <int> <dbl> <dbl>
## 1 FALSE 259 2.16 3.71
## 2 TRUE 111 1.77 6.38
Those students who have high use of alcohol also have lower mean of studytime and higher mean of absences compared to those who do not have high use of alcohol.
# simple model with only the absences as an explanatory variable
model <- glm(high_use ~ absences, data = alc, family = "binomial")
#summary of the simple model
summary(model)
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
####
# model with all the chosen variables: absences, studytime, goout and address
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")
# print out a summary of the model
summary(model1)
##
## Call:
## glm(formula = high_use ~ absences + studytime + goout + address,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9982 -0.7761 -0.4950 0.8461 2.3232
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.92997 0.56532 -3.414 0.000640 ***
## absences 0.06687 0.02221 3.011 0.002600 **
## studytime -0.59099 0.16862 -3.505 0.000457 ***
## goout 0.75392 0.12061 6.251 4.09e-10 ***
## addressU -0.73038 0.30408 -2.402 0.016308 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 371.44 on 365 degrees of freedom
## AIC: 381.44
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model1)
## (Intercept) absences studytime goout addressU
## -1.92997318 0.06687033 -0.59099384 0.75391575 -0.73038272
# compute odds ratios (OR)
OR <- coef(model1) %>% exp
# compute confidence intervals (CI)
CI <- confint(model1) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1451521 0.0466309 0.4305313
## absences 1.0691568 1.0250338 1.1197230
## studytime 0.5537766 0.3936409 0.7639234
## goout 2.1253059 1.6882118 2.7118282
## addressU 0.4817246 0.2648416 0.8757487
I looked at first a simple model, where absences was the only explanatory variable, and the AIC was 438.52. After adding the three other variables, the AIC had decreased to 381.44, so adding the other variables has improved the model.
When looking at the summary of the fitted model with all the chosen variables, all the coefficients are statistically significant at the 5 % level. As the exploration of the variables earlier implied, the relationship is positive with absences and goout, so increase in these variables will increase the odds of the student having high use of alcohol. Studytime and address on the other hand decrease the odds of having high use of alcohol.
Looking at the coefficients more closer, absences has an odds ratio of 1.069, meaning that a unit increase in absences increase the odds of the student having high use of alcohol by 6.9% keeping other variables constant, and the confidence interval is between 2.5% and 11.97%. This isn’t as large an effect, as I might have expected.
Studytime has a coefficient that is less than one, 0.55, so when studytime increases by one more unit, it decreases the odds of the student having high use of alcohol by 45% when other variables are constant, and the 95% CI is 61% - 24% decrease in odds. Studytime has an opposite effect when increasing, compared to absences.
The goout variable has an odds ratio of 2.125, meaning that a unit increase in goout increases the odds of the student having high use of alcohol by 112.5% again adjusting for the other variables in the model. The 95% CI is 68.8% - 171%. So going out has quite a large effect, which was apparent when plotting it in the beginning.
The address variable has an odds ratio of 0.48, meaning that if the student lives in an urban area, it decreases the odds of the student having high use of alcohol by 52%, when adjusting for the other variables. The negative effect has a 95% CI between 74% - 13%. This is quite surprising as in the beginning I wasn’t even sure of the direction of the effect.
First a 2x2 cross tabulation of predictions versus the actual values:
# fit the model
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")
# predictions
library(dplyr)
alc <- mutate(alc, probability = predict(model1, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 237 22
## TRUE 63 48
# initialize a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
geom_point() +
scale_color_manual(values = c("skyblue2", "skyblue4"))
This model predicts high use incorrectly for 22 students, and low use incorrectly for 63 students.
Next the the total proportion of inaccurately classified individuals, the training error.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297
The training error is 0.2297, so on average ~23% of the predictions are wrong. The training error is below 0.5, so it is better than randomly guessing.
#K-fold cross-validation, K=10
library(boot)
## Warning: package 'boot' was built under R version 4.2.2
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2405405
One set of the 10-fold cross-validation gives the prediction error of 0.2378, so 23.78% of the predictions are wrong. This is slightly smaller than the model in the exercise set had, 0.26, so the model that I have used has a bit better test set performance.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
There are 506 observations and 14 variables in this data set from MASS-package. The data is formed from housing values in suburbs of Boston. For example the variable crim tells the per capita crime rate by town.
library(GGally)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
#summary of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# draw a scatter plot matrix of the variables
pairs(Boston)
# the distribution of all the variables
ggplot(data = melt(Boston), aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
## No id variables; using all as measure variables
# correlations between the variables
cor_matrix <- cor(Boston) %>%
round(., digits = 2)
corrplot.mixed(cor_matrix)
Summaries of the variables show that the scale varies a lot for the variables.
When looking at the distributions of the variables, only the variable rm (average number of rooms per dwelling) is close to normal distribution and medv (median value of owner-occupied homes in 1000s) is somewhat normal. The variable of interest, crim, has most of the observations at the left tail as does the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft). The reverse is true for the variable black (1000(Bk−0.63)^2 where Bk is the proportion of blacks by town). There are also few variables that have two peaks at their distribution; indus (proportion of non-retail business acres per town), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000). Rest of the variables are skewed to left or right.
Finally, when looking at the correlation plot, crim correlates (positively) the most with rad. Largest correlations overall can be found with nox and dis (-0.77) and with rad and tax(0.91). It seems like indus, nox, dis, rad and tax correlate the most with other variables. On the other hand the variable chas (Charles River dummy variable) doesn’t really correlate with any variable.
First standardize the data set
# use the scale function
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# summaries of the scaled data
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The initial data had very wide range of values for each variable and the sclales varied a lot depending on the variable, so standardizing has normalized the range of the values.
Next the creation of a factor variable form the crim variable:
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high") ,include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
And finally the creation of train and test sets:
# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2599010 0.2524752 0.2425743
##
## Group means:
## zn indus chas nox rm age
## low 1.00804581 -0.9390086 -0.11325431 -0.9175307 0.44822600 -0.9056883
## med_low -0.07378209 -0.3663726 -0.04735191 -0.5962841 -0.13180125 -0.4177669
## med_high -0.40148591 0.1686282 0.15226017 0.3770968 0.03067373 0.3200631
## high -0.48724019 1.0171960 -0.11163110 1.0842613 -0.36192919 0.8284163
## dis rad tax ptratio black lstat
## low 0.8813182 -0.6941744 -0.7535119 -0.4525638 0.3877226 -0.773536443
## med_low 0.4133939 -0.5454537 -0.5029782 -0.1193521 0.3186970 -0.181994605
## med_high -0.3465323 -0.4166453 -0.3133184 -0.2416600 0.0794859 -0.002996274
## high -0.8460977 1.6373367 1.5134896 0.7798552 -0.7538385 0.981903700
## medv
## low 0.5701753
## med_low 0.0230459
## med_high 0.1597410
## high -0.7823170
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.14221981 0.75694921 -1.03551155
## indus 0.06035719 -0.38846965 0.19552418
## chas -0.02689756 -0.06546595 0.02027996
## nox 0.41459966 -0.78433585 -1.27624502
## rm 0.07993209 0.02517414 -0.11220983
## age 0.23746099 -0.21674923 -0.16300360
## dis -0.09998353 -0.32148250 0.27653895
## rad 3.38172538 0.88802898 -0.05490261
## tax -0.05668572 0.11292811 0.63294851
## ptratio 0.18570820 -0.08760605 -0.40273190
## black -0.07074568 0.02259462 0.11478631
## lstat 0.18189412 -0.14025381 0.30687598
## medv 0.01899375 -0.47627753 -0.34649881
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9522 0.0358 0.0120
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 2)
# save the crime categories from the test set
# then remove the categorical crime variable from the test dataset
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Next predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 10 4 0
## med_low 2 12 7 0
## med_high 0 4 18 2
## high 0 0 0 29
The LDA model has predicted almost all the observations belonging to the high class correctly. On the lower end, there are few observations that have been predicted to the med_low instead of the correct low class. There is more variability in the predictions for the medium classes. The prediction for med_low has wrongly predicted for low class almost as much observations. So the LDA model has more difficulties in predicting the observations belonging in the middle than on the tails.
Reload the Boston dataset and standardize the dataset:
library(MASS)
data("Boston")
# use the scale function
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)
Calculate the distances between the observations
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again:
set.seed(17)
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# optimal number of clusters seems to be 2, as there is a large drop at that point.
# k-means clustering with two clusters
km <- kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = c("steelblue3", "sienna3")[km$cluster])
Looking at the pairs function it appears that one of the clusters has always absorbed the observations on one end or direction, so that two clusters looks quite appropriate when comparing the plots of all the variables against each other.
library(MASS)
data("Boston")
set.seed(19)
# use the scale function to standardize the data
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)
# k-means clustering with 4 clusters
km <- kmeans(boston_scaled2, centers = 4)
# linear discriminant analysis
lda.fit2 <- lda(km$cluster ~ ., data = boston_scaled2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km$cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.39920949 0.31818182 0.05731225 0.22529644
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3554389 -0.4039269 -0.004726028 0.11748284 0.009509773 -0.2448635
## 2 -0.4071151 0.9395565 -0.951448942 -0.12560484 -0.934654735 0.6775631
## 3 3.0022987 -0.4872402 1.014994623 -0.27232907 1.059334474 -1.3064650
## 4 0.4410309 -0.4872402 1.093886784 0.03849464 1.033664373 -0.1906821
## age dis rad tax ptratio black lstat
## 1 0.3085558 -0.225942 -0.5764965 -0.5036322 -0.09996773 0.2544955 0.08303063
## 2 -1.0699238 1.023927 -0.5966711 -0.6791796 -0.53145201 0.3578780 -0.86641188
## 3 0.9805356 -1.048472 1.6596029 1.5294129 0.80577843 -1.1906614 1.87087595
## 4 0.7148590 -0.779002 1.4419986 1.4625319 0.72271650 -0.6534850 0.60056774
## medv
## 1 -0.09324226
## 2 0.72802972
## 3 -1.31020021
## 4 -0.52966703
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.35322461 0.307758951 -1.58997833
## zn 0.09849662 -0.330566750 -0.14199716
## indus 0.27720773 -0.031523290 0.36923953
## chas 0.03859298 0.112548411 0.06821859
## nox 0.07110504 -0.230199232 0.17319460
## rm -0.27410931 -0.568297722 0.31638241
## age 0.16778369 0.815075738 0.30710945
## dis -0.28454054 -0.606949418 -0.07519390
## rad 1.75298697 -0.745321060 0.47791025
## tax 1.03317493 -0.661045609 0.17536769
## ptratio 0.16021535 -0.003247889 0.05098139
## black -0.19237795 0.050874618 0.06252285
## lstat 0.27297507 0.032086428 -0.60051563
## medv 0.09553734 -0.204714316 -0.54447145
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8386 0.0934 0.0680
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit2, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit2, myscale = 3)
There are 4 distinct clusters that are somewhat separate, but the 4th cluster has some of the observations in the middle of the plot, not clearly belonging to any cluster. Clusters 1 and 2 are close to each other as are clusters 3 and 4.
It appears that rad, tax, age, dis and rm have the highest influence in separating the clusters. Looking at the arrows, tax and rad seem to be explaining more for cluster 4, and rm and dis more for cluster 2 for example.
# run the given code
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#first plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# target classes as numeric
classes <- as.numeric(train$crime)
# color by classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)
# k-means clustering
km <- kmeans(model_predictors, centers = 4)
# target classes as numeric
classes2 <- as.numeric(km$cluster)
# color by km clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes2)
All the plots have the same observations with most of the observations in a bigger groups and then a separate smaller group. The second plot that has the colors determined by crime classes has the separate smaller group belonging to one crime class. The third plot with the clusters has also the same smaller group in one cluster, but the lines between the clusters change a bit compared to the observations belonging to the crime classes.
# read in the created dataset
human <- read.table(file.path(".", "data", "human.txt"))
# summaries of the variables
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# visualize the "human" variables
library(GGally)
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
library(corrplot)
cor(human) %>% corrplot()
First looking at the summaries of the variables, the scale varies a lot depending on the variable. The Gross national income variable, GNI, has the largest scale, and on the other hand the relation of female over male in secondary education (Edu2.FM) and labor force participation (Labo.FM) have the smallest scale.
The distributions differ somewhat as well. The Edu2.FM and Edu.Exp variables have a more normal distribution, compared to GNI, Mat.Mor and Ado.Birth, which have most of the observations at the left tail. Rest of the variables are slightly skewed.
All of the variables have some statistically significant correlations with each other, which is necessary for the principal component analysis. Variables Parli.F and Labo.FM correlate the least with any other variable. Mat.Mor and Ado.Birth correlate negatively with the other variables, when the correlation is significant. Rest of the variables correlate mostly positively with each other.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# how much variance each dimension explains
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.2.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pca_human, main = "Scree plot: raw data")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("darkslategray4", "darkslategrey"), sub = "Large variance of gross national income weighs against all else")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# proportion of variance explained
library(factoextra)
fviz_eig(pca_human_std, main = "Scree plot: standardized data")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("darkslategray4", "darkslategrey"), ylab = "PC2 \n Femle participation", xlab = "PC1 \n Human development vs maternal and reproductive inequalities")
First looking at the standardized variables where all of them have a mean 0, GNI still has the largest maximum value, but now the large standard error is standardized.
Looking at the raw data, the first principal component (PC1) captures 0.99% of the variance and the PC2 0.0001% of the variance. All the variance is captured by the first two PCs, principally only by the PC1. Comparing this to the standardized data where the first principal component captures 0.5361% of the variance and the PC2 0.1624%, it is not dominated by just one PC. Cumulatively these two capture almost 70% of the variance. The standard deviations are quite small, 2 for the PC1, compared to the raw data case, where he standard deviation for PC1 was very large, 18540, which was almost the exact standard deviation for GNI.
Looking at the biplot of raw data, it is clear that for PC1, the GNI variable has a very large effect compared to all the other variables, and this effect takes over everything else. As noted in the beginning when looking at the summaries, GNI had a very large scale compared to all the variables. With the standardized data, other variables have more weight when the differing scales don’t dominate.
For PC1, there are two groups of variables that have an impact; on the other side the two variables that negatively correlated with others, Ado.Birth and Mat.Mor, and on the other side Edu2.FM, Life.Exp, Edu.Exp and GNI. These 4 correlated positively with each other, so they are all grouped on the left side. This dimension therefore is about human development vs maternal and reproductive inequalities.
for PC2 the two variables that hardly correlated with any other variable, Labo.FM and Parli.F, form the third group. So this dimension is about female participation in government and labor force compared to men.
# read in the data
library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
# column names to keep in the dataset
keep_columns <- c("breakfast", "tea.time","lunch", "dinner", "evening", "frequency", "home", "work", "tearoom")
# select the 'keep_columns' to create a new dataset
tea_set <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_set)
## breakfast tea.time lunch dinner
## breakfast :144 Not.tea time:131 lunch : 44 dinner : 21
## Not.breakfast:156 tea time :169 Not.lunch:256 Not.dinner:279
##
##
## evening frequency home work
## evening :103 +2/day :127 home :291 Not.work:213
## Not.evening:197 1 to 2/week: 44 Not.home: 9 work : 87
## 1/day : 95
## 3 to 6/week: 34
## tearoom
## Not.tearoom:242
## tearoom : 58
##
##
str(tea_set)
## 'data.frame': 300 obs. of 9 variables:
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ frequency: Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
# visualize the dataset
library(ggplot2)
pivot_longer(tea_set, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_bar(fill = "deeppink4") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
There are 300 observations and 36 variables in this data set. I chose the following columns: breakfast, tea.time,lunch, dinner and evening to have the different times to drink tea, frequency of the tea drinking, and finally home, work and tearoom, so also few of the places as well where to drink tea. All of these variables are factors, most of them have just two levels (yes, no), but the frequancy variable has 4 levles.Breakfast and tea.time have more evenly divided amount of responses, but dinner, home, lunch and tearoom are quite concentrated on one side.
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.2.2
mca <- MCA(tea_set, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_set, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.204 0.158 0.132 0.121 0.114 0.100 0.089
## % of var. 16.727 12.958 10.807 9.924 9.334 8.181 7.314
## Cumulative % of var. 16.727 29.685 40.492 50.416 59.750 67.931 75.246
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.088 0.085 0.069 0.061
## % of var. 7.188 6.992 5.611 4.963
## Cumulative % of var. 82.434 89.426 95.037 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.234 0.090 0.083 | -0.644 0.874 0.625 | -0.131 0.044
## 2 | 0.234 0.090 0.083 | -0.644 0.874 0.625 | -0.131 0.044
## 3 | 0.170 0.047 0.012 | 0.308 0.200 0.040 | 0.224 0.126
## 4 | 1.000 1.631 0.473 | -0.369 0.286 0.064 | 0.055 0.008
## 5 | -0.132 0.028 0.024 | -0.065 0.009 0.006 | -0.520 0.681
## 6 | 1.000 1.631 0.473 | -0.369 0.286 0.064 | 0.055 0.008
## 7 | -0.070 0.008 0.004 | -0.120 0.030 0.012 | -0.288 0.209
## 8 | 0.307 0.154 0.082 | 0.658 0.910 0.376 | 0.029 0.002
## 9 | -0.311 0.158 0.186 | -0.327 0.225 0.206 | -0.141 0.050
## 10 | -0.394 0.253 0.133 | 0.112 0.027 0.011 | -0.150 0.057
## cos2
## 1 0.026 |
## 2 0.026 |
## 3 0.021 |
## 4 0.001 |
## 5 0.369 |
## 6 0.001 |
## 7 0.067 |
## 8 0.001 |
## 9 0.039 |
## 10 0.019 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | -0.604 9.505 0.336 -10.028 | -0.535 9.627 0.264
## Not.breakfast | 0.557 8.773 0.336 10.028 | 0.494 8.886 0.264
## Not.tea time | 0.626 9.290 0.303 9.525 | -0.179 0.979 0.025
## tea time | -0.485 7.201 0.303 -9.525 | 0.139 0.759 0.025
## lunch | -0.897 6.419 0.138 -6.433 | 0.960 9.480 0.158
## Not.lunch | 0.154 1.103 0.138 6.433 | -0.165 1.629 0.158
## dinner | 1.819 12.586 0.249 8.629 | -0.039 0.007 0.000
## Not.dinner | -0.137 0.947 0.249 -8.629 | 0.003 0.001 0.000
## evening | -0.250 1.170 0.033 -3.131 | 0.825 16.378 0.356
## Not.evening | 0.131 0.612 0.033 3.131 | -0.431 8.563 0.356
## v.test Dim.3 ctr cos2 v.test
## breakfast -8.883 | -0.106 0.454 0.010 -1.762 |
## Not.breakfast 8.883 | 0.098 0.419 0.010 1.762 |
## Not.tea time -2.722 | -0.409 6.155 0.130 -6.232 |
## tea time 2.722 | 0.317 4.771 0.130 6.232 |
## lunch 6.881 | -0.707 6.168 0.086 -5.069 |
## Not.lunch -6.881 | 0.122 1.060 0.086 5.069 |
## dinner -0.184 | 0.378 0.842 0.011 1.794 |
## Not.dinner 0.184 | -0.028 0.063 0.011 -1.794 |
## evening 10.310 | -0.335 3.239 0.059 -4.187 |
## Not.evening -10.310 | 0.175 1.694 0.059 4.187 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.336 0.264 0.010 |
## tea.time | 0.303 0.025 0.130 |
## lunch | 0.138 0.158 0.086 |
## dinner | 0.249 0.000 0.011 |
## evening | 0.033 0.356 0.059 |
## frequency | 0.427 0.501 0.220 |
## home | 0.052 0.058 0.199 |
## work | 0.124 0.000 0.246 |
## tearoom | 0.177 0.063 0.227 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
When looking at the summary, the first two dimensions compute about 30% of the variance. What can be seen in the bar plots of contributions (below this text) to the variance for dimensions 1-2, is that variables breakfast and frequency (levels 1 to 2/ week and 1 / day) are at the top. The four levels that had the least to contribute are Not.lunch, Not.work, Not.dinner and home. These are the same variables that had most of the answers from the two levels of a variable. As the dinner level for example had very few people, it is very far from the center in the MCA factor map for the first dimension.
# another useful package
library(factoextra)
# Contributions of rows to dimension 1
s1 <- fviz_contrib(mca, choice = "var", axes = 1, top = 20)
# Contributions of rows to dimension 2
s2 <- fviz_contrib(mca, choice = "var", axes = 2, top = 20)
# combine the two plots
library(patchwork)
s1 | s2
#The total contributions to dimension 1 and 2 are obtained as follow:
fviz_contrib(mca, choice = "var", axes = 1:2, top = 20)
# color by contribution
fviz_mca_var(mca, col.var = "contrib",
gradient.cols = c("springgreen1", "springgreen3", "darkgreen"),
repel = TRUE, # avoid text overlapping (slow)
ggtheme = theme_minimal()
)
I think this last plot is quite demonstrative of the contributions.